{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's first check that all our variants identified to be in calls of two different tools are indeed there:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%bash\n",
    "cd /data/NCR_SBRB/simplex/\n",
    "for id in `grep -v '#' snv_arm/10033_trio1_gatkANDdng.vcf | awk 'BEGIN {FS=\"\\t\"}; {print $2}' -`; do\n",
    "    if ! grep -q $id dng/10033_trio1_dnm.vcf || ! grep -q $id gatk_refine/10033_trio1_hiConfDeNovo.vcf; then\n",
    "        echo $id;\n",
    "    fi;\n",
    "done"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "OK, so let's assume this also works for the other trios. Now, let's figure out how many interesting DNVs showed up both in GATK and DNG:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 64,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "10033\n",
      "2\n",
      "10042\n",
      "1\n",
      "10090\n",
      "2\n",
      "10094\n",
      "2\n",
      "10128\n",
      "2\n",
      "10131\n",
      "4\n",
      "10153\n",
      "3\n",
      "10164\n",
      "2\n",
      "10173\n",
      "2\n",
      "10178\n",
      "2\n",
      "10182\n",
      "3\n",
      "10197\n",
      "2\n",
      "10215\n",
      "4\n",
      "10369\n",
      "2\n",
      "10406\n",
      "3\n",
      "10448\n",
      "2\n",
      "1892\n",
      "2\n",
      "1893\n",
      "2\n",
      "1895\n",
      "2\n",
      "1976\n",
      "3\n",
      "855\n",
      "2\n"
     ]
    }
   ],
   "source": [
    "%%bash\n",
    "suffix=gatkANDdng.vcf\n",
    "cd /data/NCR_SBRB/simplex/snv_arm\n",
    "rm interesting_snvs_${suffix}.txt\n",
    "# figure out all family IDs\n",
    "ls -1 *_trio1_${suffix} > famids.txt\n",
    "sed -i -e \"s/_trio1_${suffix}//g\" famids.txt\n",
    "\n",
    "# for each family ID\n",
    "while read fam; do\n",
    "  # figure out how many trios we have\n",
    "   ntrios=`ls -1 ${fam}_trio?_${suffix} | wc -l`;\n",
    "   ntrios=$(($ntrios))\n",
    "   echo $fam;\n",
    "   echo $ntrios\n",
    "   # if we have more than one (assuming the first one is affected)\n",
    "   if [ $ntrios -gt 1 ]; then\n",
    "      # get all SNVs in the affected trio in the family\n",
    "      cut -f 1,2 ${fam}_trio1_${suffix} | grep -v '#' - > ${fam}_possible_snvs_${suffix}.txt;\n",
    "      # combine the vcf files of all unnafected trios\n",
    "      cat ${fam}_trio[2..$ntrios]_${suffix} > ${fam}_control_snvs_${suffix}.txt;\n",
    "      # for each possible SNV in affected trio, mark it as interesting if it's not\n",
    "      # in the unnafected trios\n",
    "      while read snv; do\n",
    "         if ! grep -q \"$snv\" ${fam}_control_snvs_${suffix}.txt; then\n",
    "            echo $snv >> interesting_snvs_${suffix}.txt;\n",
    "         fi;\n",
    "      done < ${fam}_possible_snvs_${suffix}.txt;\n",
    "   fi;\n",
    "done < famids.txt"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "OK, so now interesting_snvs.txt hold all DNVs that occur in affected kids but not in the unaffected sibs, called by both GATK and DenovoGear. If we count how often each variant appears, we get how many families:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 65,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>snp</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>chr1 91852821</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>chr1 142726878</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>chr1 142825133</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>chr1 142825147</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>chr1 142825151</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "              snp\n",
       "0   chr1 91852821\n",
       "1  chr1 142726878\n",
       "2  chr1 142825133\n",
       "3  chr1 142825147\n",
       "4  chr1 142825151"
      ]
     },
     "execution_count": 65,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import pandas as pd\n",
    "snps_gatkANDdng = pd.read_table('/data/NCR_SBRB/simplex/snv_arm/interesting_snvs_gatkANDdng.vcf.txt',\n",
    "                               header=None, names=['snp'])\n",
    "snps_gatkANDdng.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 67,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Maximum frequency: 2 in 536 unique snps\n",
      "[['chr17 72762902' 2]\n",
      " ['chr18 15271172' 2]\n",
      " ['chr2 89105006' 2]\n",
      " ['chr20 29637674' 2]\n",
      " ['chr20 29637691' 2]\n",
      " ['chr6 32548712' 2]\n",
      " ['chr6 32557647' 2]]\n"
     ]
    }
   ],
   "source": [
    "from scipy import stats\n",
    "counts_gatkANDdng = stats.itemfreq(snps_gatkANDdng['snp'])\n",
    "my_max = np.max(counts_gatkANDdng[:, 1])\n",
    "print 'Maximum frequency: %d in %d unique snps' % (my_max, len(counts_gatkANDdng))\n",
    "print counts_gatkANDdng[counts_gatkANDdng[:, 1] == my_max]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can do the same thing for the combination of GATK and triodenovo:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 69,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "10033\n",
      "2\n",
      "10042\n",
      "1\n",
      "10090\n",
      "2\n",
      "10094\n",
      "2\n",
      "10128\n",
      "2\n",
      "10131\n",
      "4\n",
      "10153\n",
      "3\n",
      "10164\n",
      "2\n",
      "10173\n",
      "2\n",
      "10178\n",
      "2\n",
      "10182\n",
      "3\n",
      "10197\n",
      "2\n",
      "10215\n",
      "3\n",
      "10369\n",
      "2\n",
      "10406\n",
      "3\n",
      "10448\n",
      "2\n",
      "1892\n",
      "2\n",
      "1893\n",
      "2\n",
      "1895\n",
      "2\n",
      "1976\n",
      "3\n",
      "855\n",
      "2\n"
     ]
    }
   ],
   "source": [
    "%%bash\n",
    "suffix=gatkANDtdn.vcf\n",
    "cd /data/NCR_SBRB/simplex/snv_arm\n",
    "rm interesting_snvs_${suffix}.txt\n",
    "# figure out all family IDs\n",
    "ls -1 *_trio1_${suffix} > famids.txt\n",
    "sed -i -e \"s/_trio1_${suffix}//g\" famids.txt\n",
    "\n",
    "# for each family ID\n",
    "while read fam; do\n",
    "  # figure out how many trios we have\n",
    "   ntrios=`ls -1 ${fam}_trio?_${suffix} | wc -l`;\n",
    "   ntrios=$(($ntrios))\n",
    "   echo $fam;\n",
    "   echo $ntrios\n",
    "   # if we have more than one (assuming the first one is affected)\n",
    "   if [ $ntrios -gt 1 ]; then\n",
    "      # get all SNVs in the affected trio in the family\n",
    "      cut -f 1,2 ${fam}_trio1_${suffix} | grep -v '#' - > ${fam}_possible_snvs_${suffix}.txt;\n",
    "      # combine the vcf files of all unnafected trios\n",
    "      cat ${fam}_trio[2..$ntrios]_${suffix} > ${fam}_control_snvs_${suffix}.txt;\n",
    "      # for each possible SNV in affected trio, mark it as interesting if it's not\n",
    "      # in the unnafected trios\n",
    "      while read snv; do\n",
    "         if ! grep -q \"$snv\" ${fam}_control_snvs_${suffix}.txt; then\n",
    "            echo $snv >> interesting_snvs_${suffix}.txt;\n",
    "         fi;\n",
    "      done < ${fam}_possible_snvs_${suffix}.txt;\n",
    "   fi;\n",
    "done < famids.txt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 70,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Maximum frequency: 2 in 631 unique snps\n",
      "[['chr17 72762902' 2]\n",
      " ['chr18 15271172' 2]\n",
      " ['chr2 89105006' 2]\n",
      " ['chr20 29637674' 2]\n",
      " ['chr20 29637691' 2]\n",
      " ['chr6 29968761' 2]\n",
      " ['chr6 32548712' 2]\n",
      " ['chr6 32557647' 2]\n",
      " ['chr7 154002420' 2]]\n"
     ]
    }
   ],
   "source": [
    "snps_gatkANDtdn = pd.read_table('/data/NCR_SBRB/simplex/snv_arm/interesting_snvs_gatkANDtdn.vcf.txt',\n",
    "                               header=None, names=['snp'])\n",
    "counts_gatkANDtdn = stats.itemfreq(snps_gatkANDtdn['snp'])\n",
    "my_max = np.max(counts_gatkANDtdn[:, 1])\n",
    "print 'Maximum frequency: %d in %d unique snps' % (my_max, len(counts_gatkANDtdn))\n",
    "print counts_gatkANDtdn[counts_gatkANDtdn[:, 1] == my_max]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Good... some intersection too. What if we try the intersection of all 3?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 72,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "10033\n",
      "2\n",
      "10042\n",
      "1\n",
      "10090\n",
      "2\n",
      "10094\n",
      "2\n",
      "10128\n",
      "2\n",
      "10131\n",
      "4\n",
      "10153\n",
      "3\n",
      "10164\n",
      "2\n",
      "10173\n",
      "2\n",
      "10178\n",
      "2\n",
      "10182\n",
      "3\n",
      "10197\n",
      "2\n",
      "10215\n",
      "3\n",
      "10369\n",
      "2\n",
      "10406\n",
      "3\n",
      "10448\n",
      "2\n",
      "1892\n",
      "2\n",
      "1893\n",
      "2\n",
      "1895\n",
      "2\n",
      "1976\n",
      "3\n",
      "855\n",
      "2\n"
     ]
    }
   ],
   "source": [
    "%%bash\n",
    "suffix=ensemble.vcf\n",
    "cd /data/NCR_SBRB/simplex/snv_arm\n",
    "rm interesting_snvs_${suffix}.txt\n",
    "# figure out all family IDs\n",
    "ls -1 *_trio1_${suffix} > famids.txt\n",
    "sed -i -e \"s/_trio1_${suffix}//g\" famids.txt\n",
    "\n",
    "# for each family ID\n",
    "while read fam; do\n",
    "  # figure out how many trios we have\n",
    "   ntrios=`ls -1 ${fam}_trio?_${suffix} | wc -l`;\n",
    "   ntrios=$(($ntrios))\n",
    "   echo $fam;\n",
    "   echo $ntrios\n",
    "   # if we have more than one (assuming the first one is affected)\n",
    "   if [ $ntrios -gt 1 ]; then\n",
    "      # get all SNVs in the affected trio in the family\n",
    "      cut -f 1,2 ${fam}_trio1_${suffix} | grep -v '#' - > ${fam}_possible_snvs_${suffix}.txt;\n",
    "      # combine the vcf files of all unnafected trios\n",
    "      cat ${fam}_trio[2..$ntrios]_${suffix} > ${fam}_control_snvs_${suffix}.txt;\n",
    "      # for each possible SNV in affected trio, mark it as interesting if it's not\n",
    "      # in the unnafected trios\n",
    "      while read snv; do\n",
    "         if ! grep -q \"$snv\" ${fam}_control_snvs_${suffix}.txt; then\n",
    "            echo $snv >> interesting_snvs_${suffix}.txt;\n",
    "         fi;\n",
    "      done < ${fam}_possible_snvs_${suffix}.txt;\n",
    "   fi;\n",
    "done < famids.txt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 98,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Maximum frequency: 2 in 535 unique snps\n",
      "[['chr17 72762902' 2]\n",
      " ['chr18 15271172' 2]\n",
      " ['chr2 89105006' 2]\n",
      " ['chr20 29637674' 2]\n",
      " ['chr20 29637691' 2]\n",
      " ['chr6 32548712' 2]\n",
      " ['chr6 32557647' 2]]\n"
     ]
    }
   ],
   "source": [
    "snps = pd.read_table('/data/NCR_SBRB/simplex/snv_arm/interesting_snvs_ensemble.vcf.txt',\n",
    "                               header=None, names=['snp'])\n",
    "counts = stats.itemfreq(snps['snp'])\n",
    "my_max = np.max(counts[:, 1])\n",
    "print 'Maximum frequency: %d in %d unique snps' % (my_max, len(counts))\n",
    "print counts[counts[:, 1] == my_max]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Then, the question is: out of 535 unique SNPs, if I were to choose 20 with replacement (as one of the families only has trio1, and therefore didn't get counted), how often does the same SNP get selected twice just by chance? "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 77,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "def do_perms(snps, noften, nboot=10000, npicks=20):\n",
    "    success = 0\n",
    "    for i in range(nboot):\n",
    "        picks = np.random.choice(snps, npicks, replace = True)\n",
    "        counts = stats.itemfreq(picks)\n",
    "        nmax = np.max(counts[:,1])\n",
    "        if (nmax >= noften):\n",
    "            success += 1\n",
    "    return(success/float(nboot))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 86,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.3053"
      ]
     },
     "execution_count": 86,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "do_perms(counts[:, 0], 2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Not good... well, do any of those IDs come up in the affected trio I'm not counting?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 93,
   "metadata": {},
   "outputs": [],
   "source": [
    "fid = open('/home/sudregp/tmp/good_snps.txt', 'w')\n",
    "idx = counts[:, 1] == my_max\n",
    "for rsid in counts[idx, 0]:\n",
    "    fid.write('%s\\n' % rsid.split(' ')[1])\n",
    "fid.close()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 95,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "7 /home/sudregp/tmp/good_snps.txt\n"
     ]
    }
   ],
   "source": [
    "%%bash\n",
    "echo `wc -l /home/sudregp/tmp/good_snps.txt`\n",
    "grep -f /home/sudregp/tmp/good_snps.txt /data/NCR_SBRB/simplex/snv_arm/10042_trio1_ensemble.vcf"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Not realy... oh well. Our options here are:\n",
    "* do this within tool, with the hopes of making the stats better\n",
    "* look at all DNVs in affected siblings, and check that the more frequent ones are not as frequent in unaffected sibs (no within-family requirement)\n",
    "* drop the initial quality filter in GATK pipeline?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Within tools stats"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## GATK"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 96,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "10033\n",
      "2\n",
      "10042\n",
      "1\n",
      "10090\n",
      "2\n",
      "10094\n",
      "2\n",
      "10128\n",
      "2\n",
      "10131\n",
      "4\n",
      "10153\n",
      "3\n",
      "10164\n",
      "2\n",
      "10173\n",
      "2\n",
      "10178\n",
      "2\n",
      "10182\n",
      "3\n",
      "10197\n",
      "2\n",
      "10215\n",
      "4\n",
      "10369\n",
      "2\n",
      "10406\n",
      "3\n",
      "10448\n",
      "2\n",
      "1892\n",
      "2\n",
      "1893\n",
      "2\n",
      "1895\n",
      "2\n",
      "1976\n",
      "3\n",
      "855\n",
      "2\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "rm: cannot remove `interesting_snvs_hiConfDeNovo.vcf.txt': No such file or directory\n"
     ]
    }
   ],
   "source": [
    "%%bash\n",
    "suffix=hiConfDeNovo.vcf \n",
    "cd /data/NCR_SBRB/simplex/gatk_refine\n",
    "rm interesting_snvs_${suffix}.txt\n",
    "# figure out all family IDs\n",
    "ls -1 *_trio1_${suffix} > famids.txt\n",
    "sed -i -e \"s/_trio1_${suffix}//g\" famids.txt\n",
    "\n",
    "# for each family ID\n",
    "while read fam; do\n",
    "  # figure out how many trios we have\n",
    "   ntrios=`ls -1 ${fam}_trio?_${suffix} | wc -l`;\n",
    "   ntrios=$(($ntrios))\n",
    "   echo $fam;\n",
    "   echo $ntrios\n",
    "   # if we have more than one (assuming the first one is affected)\n",
    "   if [ $ntrios -gt 1 ]; then\n",
    "      # get all SNVs in the affected trio in the family\n",
    "      cut -f 1,2 ${fam}_trio1_${suffix} | grep -v '#' - > ${fam}_possible_snvs_${suffix}.txt;\n",
    "      # combine the vcf files of all unnafected trios\n",
    "      cat ${fam}_trio[2..$ntrios]_${suffix} > ${fam}_control_snvs_${suffix}.txt;\n",
    "      # for each possible SNV in affected trio, mark it as interesting if it's not\n",
    "      # in the unnafected trios\n",
    "      while read snv; do\n",
    "         if ! grep -q \"$snv\" ${fam}_control_snvs_${suffix}.txt; then\n",
    "            echo $snv >> interesting_snvs_${suffix}.txt;\n",
    "         fi;\n",
    "      done < ${fam}_possible_snvs_${suffix}.txt;\n",
    "   fi;\n",
    "done < famids.txt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 99,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Maximum frequency: 2 in 3388 unique snps\n",
      "[['chr1 40229504' 2]\n",
      " ['chr17 72762902' 2]\n",
      " ['chr18 15271172' 2]\n",
      " ['chr2 89105006' 2]\n",
      " ['chr20 29637674' 2]\n",
      " ['chr20 29637691' 2]\n",
      " ['chr6 29968761' 2]\n",
      " ['chr6 32548712' 2]\n",
      " ['chr6 32557647' 2]\n",
      " ['chr7 154002420' 2]]\n"
     ]
    }
   ],
   "source": [
    "snps_gatk = pd.read_table('/data/NCR_SBRB/simplex/gatk_refine/interesting_snvs_hiConfDeNovo.vcf.txt',\n",
    "                               header=None, names=['snp'])\n",
    "counts_gatk = stats.itemfreq(snps_gatk['snp'])\n",
    "my_max = np.max(counts_gatk[:, 1])\n",
    "print 'Maximum frequency: %d in %d unique snps' % (my_max, len(counts_gatk))\n",
    "print counts_gatk[counts_gatk[:, 1] == my_max]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 102,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.0578"
      ]
     },
     "execution_count": 102,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "do_perms(counts_gatk[:, 0], 2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This is looking better, as we have increased the options of unique SNPs. Not great yet, but getting there."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Triodenovo"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 103,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "10033\n",
      "2\n",
      "10042\n",
      "1\n",
      "10090\n",
      "2\n",
      "10094\n",
      "2\n",
      "10128\n",
      "2\n",
      "10131\n",
      "4\n",
      "10153\n",
      "3\n",
      "10164\n",
      "2\n",
      "10173\n",
      "2\n",
      "10178\n",
      "2\n",
      "10182\n",
      "3\n",
      "10197\n",
      "2\n",
      "10215\n",
      "4\n",
      "10369\n",
      "2\n",
      "10406\n",
      "3\n",
      "10448\n",
      "2\n",
      "1892\n",
      "2\n",
      "1893\n",
      "2\n",
      "1895\n",
      "2\n",
      "1976\n",
      "3\n",
      "855\n",
      "2\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "rm: cannot remove `interesting_snvs_denovo_v2.vcf.txt': No such file or directory\n"
     ]
    }
   ],
   "source": [
    "%%bash\n",
    "suffix=denovo_v2.vcf \n",
    "cd /data/NCR_SBRB/simplex/triodenovo\n",
    "rm interesting_snvs_${suffix}.txt\n",
    "# figure out all family IDs\n",
    "ls -1 *_trio1_${suffix} > famids.txt\n",
    "sed -i -e \"s/_trio1_${suffix}//g\" famids.txt\n",
    "\n",
    "# for each family ID\n",
    "while read fam; do\n",
    "  # figure out how many trios we have\n",
    "   ntrios=`ls -1 ${fam}_trio?_${suffix} | wc -l`;\n",
    "   ntrios=$(($ntrios))\n",
    "   echo $fam;\n",
    "   echo $ntrios\n",
    "   # if we have more than one (assuming the first one is affected)\n",
    "   if [ $ntrios -gt 1 ]; then\n",
    "      # get all SNVs in the affected trio in the family\n",
    "      cut -f 1,2 ${fam}_trio1_${suffix} | grep -v '#' - > ${fam}_possible_snvs_${suffix}.txt;\n",
    "      # combine the vcf files of all unnafected trios\n",
    "      cat ${fam}_trio[2..$ntrios]_${suffix} > ${fam}_control_snvs_${suffix}.txt;\n",
    "      # for each possible SNV in affected trio, mark it as interesting if it's not\n",
    "      # in the unnafected trios\n",
    "      while read snv; do\n",
    "         if ! grep -q \"$snv\" ${fam}_control_snvs_${suffix}.txt; then\n",
    "            echo $snv >> interesting_snvs_${suffix}.txt;\n",
    "         fi;\n",
    "      done < ${fam}_possible_snvs_${suffix}.txt;\n",
    "   fi;\n",
    "done < famids.txt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 105,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Maximum frequency: 5 in 18802 unique snps\n",
      "[['chr1 16914580' 5]\n",
      " ['chr10 49319323' 5]\n",
      " ['chr12 69667893' 5]\n",
      " ['chrX 102973509' 5]\n",
      " ['chrX 118751076' 5]\n",
      " ['chrX 13397236' 5]\n",
      " ['chrX 135307049' 5]\n",
      " ['chrX 14599572' 5]\n",
      " ['chrX 152610294' 5]\n",
      " ['chrX 153008911' 5]\n",
      " ['chrX 153880181' 5]\n",
      " ['chrX 153904473' 5]\n",
      " ['chrX 154774663' 5]\n",
      " ['chrX 16876980' 5]\n",
      " ['chrX 38262808' 5]\n",
      " ['chrX 43601142' 5]\n",
      " ['chrX 96502650' 5]]\n"
     ]
    }
   ],
   "source": [
    "snps_tdn = pd.read_table('/data/NCR_SBRB/simplex/triodenovo/interesting_snvs_denovo_v2.vcf.txt',\n",
    "                               header=None, names=['snp'])\n",
    "counts_tdn = stats.itemfreq(snps_tdn['snp'])\n",
    "my_max = np.max(counts_tdn[:, 1])\n",
    "print 'Maximum frequency: %d in %d unique snps' % (my_max, len(counts_tdn))\n",
    "print counts_tdn[counts_tdn[:, 1] == my_max]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 106,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.0"
      ]
     },
     "execution_count": 106,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "do_perms(counts_tdn[:, 0], 5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This is much better, with 5 trios showing the variant. I'd ignore the chrX variants for now. Somewhat concerning that the GATK refinement pipeline didn't show those 3 other variants... maybe a function of the QC?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## DenovoGear"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 108,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "10033\n",
      "2\n",
      "10042\n",
      "1\n",
      "10090\n",
      "2\n",
      "10094\n",
      "2\n",
      "10128\n",
      "2\n",
      "10131\n",
      "4\n",
      "10153\n",
      "3\n",
      "10164\n",
      "2\n",
      "10173\n",
      "2\n",
      "10178\n",
      "2\n",
      "10182\n",
      "3\n",
      "10197\n",
      "2\n",
      "10215\n",
      "4\n",
      "10369\n",
      "2\n",
      "10406\n",
      "3\n",
      "10448\n",
      "2\n",
      "1892\n",
      "2\n",
      "1893\n",
      "2\n",
      "1895\n",
      "2\n",
      "1976\n",
      "3\n",
      "855\n",
      "2\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "rm: cannot remove `interesting_snvs_dnm.vcf.txt': No such file or directory\n"
     ]
    }
   ],
   "source": [
    "%%bash\n",
    "suffix=dnm.vcf \n",
    "cd /data/NCR_SBRB/simplex/dng\n",
    "rm interesting_snvs_${suffix}.txt\n",
    "# figure out all family IDs\n",
    "ls -1 *_trio1_${suffix} > famids.txt\n",
    "sed -i -e \"s/_trio1_${suffix}//g\" famids.txt\n",
    "\n",
    "# for each family ID\n",
    "while read fam; do\n",
    "  # figure out how many trios we have\n",
    "   ntrios=`ls -1 ${fam}_trio?_${suffix} | wc -l`;\n",
    "   ntrios=$(($ntrios))\n",
    "   echo $fam;\n",
    "   echo $ntrios\n",
    "   # if we have more than one (assuming the first one is affected)\n",
    "   if [ $ntrios -gt 1 ]; then\n",
    "      # get all SNVs in the affected trio in the family\n",
    "      cut -f 1,2 ${fam}_trio1_${suffix} | grep -v '#' - > ${fam}_possible_snvs_${suffix}.txt;\n",
    "      # combine the vcf files of all unnafected trios\n",
    "      cat ${fam}_trio[2..$ntrios]_${suffix} > ${fam}_control_snvs_${suffix}.txt;\n",
    "      # for each possible SNV in affected trio, mark it as interesting if it's not\n",
    "      # in the unnafected trios\n",
    "      while read snv; do\n",
    "         if ! grep -q \"$snv\" ${fam}_control_snvs_${suffix}.txt; then\n",
    "            echo $snv >> interesting_snvs_${suffix}.txt;\n",
    "         fi;\n",
    "      done < ${fam}_possible_snvs_${suffix}.txt;\n",
    "   fi;\n",
    "done < famids.txt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 109,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Maximum frequency: 6 in 20901 unique snps\n",
      "[['chr2 89072915' 6]\n",
      " ['chr8 97156476' 6]]\n"
     ]
    }
   ],
   "source": [
    "snps_dng = pd.read_table('/data/NCR_SBRB/simplex/dng/interesting_snvs_dnm.vcf.txt',\n",
    "                               header=None, names=['snp'])\n",
    "counts_dng = stats.itemfreq(snps_dng['snp'])\n",
    "my_max = np.max(counts_dng[:, 1])\n",
    "print 'Maximum frequency: %d in %d unique snps' % (my_max, len(counts_dng))\n",
    "print counts_dng[counts_dng[:, 1] == my_max]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Well, 6 families is better than 5 (or 2). But the other tools didn't pick that one up. Still worth analyzing it..."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Before we continue, do any of these come up in the family without unaffected data?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 113,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%bash\n",
    "grep 89072915 /data/NCR_SBRB/simplex/dng/10042_trio1_dnm.vcf\n",
    "grep 97156476 /data/NCR_SBRB/simplex/dng/10042_trio1_dnm.vcf\n",
    "grep 16914580 /data/NCR_SBRB/simplex/triodenovo/10042_trio1_denovo_v2.vcf\n",
    "grep 49319323 /data/NCR_SBRB/simplex/triodenovo/10042_trio1_denovo_v2.vcf\n",
    "grep 69667893 /data/NCR_SBRB/simplex/triodenovo/10042_trio1_denovo_v2.vcf"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Nothing... oh well. We can still look for biological plausibility on them, just for kicks."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Within group stats\n",
    "\n",
    "The approach here is to calculate the best stats in all ADHD samples, and see what's the best we can do for a specific variable in non-affected siblings. For comparison, we can do it the other way around as well.\n",
    "\n",
    "## Triodenovo"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 122,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "%%bash\n",
    "cd /data/NCR_SBRB/simplex/triodenovo/\n",
    "\n",
    "# concatenate all affected trios and extract the SNPs\n",
    "cat *_trio1_*.vcf | grep -v '#' - | awk 'BEGIN {FS=\"\\t\"; OFS=\":\"}; {print $1, $2}' - > affected_snvs.txt;\n",
    "cat *_trio[2..4]_*.vcf | grep -v '#' - | awk 'BEGIN {FS=\"\\t\"; OFS=\":\"}; {print $1, $2}' - > unaffected_snvs.txt;"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 123,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Maximum frequency: 28 in 21 affected families\n",
      "[['chrX:41093413' 28]]\n",
      "Maximum frequency: 24 in 29 unaffected families\n",
      "[['chrX:9686187' 24]]\n"
     ]
    }
   ],
   "source": [
    "naff = 21\n",
    "nunaff = 29\n",
    "aff_tdn = pd.read_table('/data/NCR_SBRB/simplex/triodenovo/affected_snvs.txt',\n",
    "                               header=None, names=['snp'])\n",
    "counts_aff_tdn = stats.itemfreq(aff_tdn['snp'])\n",
    "my_max = np.max(counts_aff_tdn[:, 1])\n",
    "print 'Maximum frequency: %d in %d affected families' % (my_max, naff)\n",
    "print counts_aff_tdn[counts_aff_tdn[:, 1] == my_max]\n",
    "\n",
    "unaff_tdn = pd.read_table('/data/NCR_SBRB/simplex/triodenovo/unaffected_snvs.txt',\n",
    "                               header=None, names=['snp'])\n",
    "counts_unaff_tdn = stats.itemfreq(unaff_tdn['snp'])\n",
    "my_max = np.max(counts_unaff_tdn[:, 1])\n",
    "print 'Maximum frequency: %d in %d unaffected families' % (my_max, nunaff)\n",
    "print counts_unaff_tdn[counts_unaff_tdn[:, 1] == my_max]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "*WEIRD!!! DUPLICATE RSIDS IN VCF FILES? HOWNCOME MORE OCCURRENCES THAN FAMILIES?*"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## DenovoGear"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 124,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "%%bash\n",
    "cd /data/NCR_SBRB/simplex/dng/\n",
    "\n",
    "# concatenate all affected trios and extract the SNPs\n",
    "cat *_trio1_*.vcf | grep -v '#' - | awk 'BEGIN {FS=\"\\t\"; OFS=\":\"}; {print $1, $2}' - > affected_snvs.txt;\n",
    "cat *_trio[2..4]_*.vcf | grep -v '#' - | awk 'BEGIN {FS=\"\\t\"; OFS=\":\"}; {print $1, $2}' - > unaffected_snvs.txt;"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 125,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Maximum frequency: 8 in 21 affected families\n",
      "[['chr17:21319860' 8]\n",
      " ['chrX:8432783' 8]]\n",
      "Maximum frequency: 7 in 29 unaffected families\n",
      "[['chr4:113190251' 7]\n",
      " ['chrX:13779124' 7]\n",
      " ['chrX:152772473' 7]\n",
      " ['chrX:8432783' 7]\n",
      " ['chrY:9967496' 7]]\n"
     ]
    }
   ],
   "source": [
    "naff = 21\n",
    "nunaff = 29\n",
    "aff_dng = pd.read_table('/data/NCR_SBRB/simplex/dng/affected_snvs.txt',\n",
    "                               header=None, names=['snp'])\n",
    "counts_aff_dng = stats.itemfreq(aff_dng['snp'])\n",
    "my_max = np.max(counts_aff_dng[:, 1])\n",
    "print 'Maximum frequency: %d in %d affected families' % (my_max, naff)\n",
    "print counts_aff_dng[counts_aff_dng[:, 1] == my_max]\n",
    "\n",
    "unaff_dng = pd.read_table('/data/NCR_SBRB/simplex/dng/unaffected_snvs.txt',\n",
    "                               header=None, names=['snp'])\n",
    "counts_unaff_dng = stats.itemfreq(unaff_dng['snp'])\n",
    "my_max = np.max(counts_unaff_dng[:, 1])\n",
    "print 'Maximum frequency: %d in %d unaffected families' % (my_max, nunaff)\n",
    "print counts_unaff_dng[counts_unaff_dng[:, 1] == my_max]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## GATK"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 117,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "%%bash\n",
    "cd /data/NCR_SBRB/simplex/gatk_refine\n",
    "\n",
    "# concatenate all affected trios and extract the SNPs\n",
    "cat *_trio1_*.vcf | grep -v '#' - | awk 'BEGIN {FS=\"\\t\"; OFS=\":\"}; {print $1, $2}' - > affected_snvs.txt;\n",
    "cat *_trio[2..4]_*.vcf | grep -v '#' - | awk 'BEGIN {FS=\"\\t\"; OFS=\":\"}; {print $1, $2}' - > unaffected_snvs.txt;"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 121,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Maximum frequency: 2 in 21 affected families\n",
      "[['chr17:72762902' 2]\n",
      " ['chr18:15271172' 2]\n",
      " ['chr1:40229504' 2]\n",
      " ['chr20:29637674' 2]\n",
      " ['chr20:29637691' 2]\n",
      " ['chr2:89105006' 2]\n",
      " ['chr6:29968761' 2]\n",
      " ['chr6:32548712' 2]\n",
      " ['chr6:32557647' 2]\n",
      " ['chr6:32725367' 2]\n",
      " ['chr7:154002420' 2]\n",
      " ['chr7:154467985' 2]]\n",
      "Maximum frequency: 2 in 29 unaffected families\n",
      "[['chr12:11214231' 2]\n",
      " ['chr12:11214232' 2]\n",
      " ['chr12:50745851' 2]\n",
      " ['chr12:50745893' 2]\n",
      " ['chr12:50745894' 2]\n",
      " ['chr12:99139159' 2]\n",
      " ['chr16:33940111' 2]\n",
      " ['chr16:33940122' 2]\n",
      " ['chr1:142813302' 2]\n",
      " ['chr1:148902738' 2]\n",
      " ['chr1:206566826' 2]\n",
      " ['chr20:29638202' 2]\n",
      " ['chr2:9546136' 2]\n",
      " ['chr6:32083111' 2]\n",
      " ['chrX:116025284' 2]]\n"
     ]
    }
   ],
   "source": [
    "naff = 21\n",
    "nunaff = 29\n",
    "aff_gatk = pd.read_table('/data/NCR_SBRB/simplex/gatk_refine/affected_snvs.txt',\n",
    "                               header=None, names=['snp'])\n",
    "counts_aff_gatk = stats.itemfreq(aff_gatk['snp'])\n",
    "my_max = np.max(counts_aff_gatk[:, 1])\n",
    "print 'Maximum frequency: %d in %d affected families' % (my_max, naff)\n",
    "print counts_aff_gatk[counts_aff_gatk[:, 1] == my_max]\n",
    "\n",
    "unaff_gatk = pd.read_table('/data/NCR_SBRB/simplex/gatk_refine/unaffected_snvs.txt',\n",
    "                               header=None, names=['snp'])\n",
    "counts_unaff_gatk = stats.itemfreq(unaff_gatk['snp'])\n",
    "my_max = np.max(counts_unaff_gatk[:, 1])\n",
    "print 'Maximum frequency: %d in %d unaffected families' % (my_max, nunaff)\n",
    "print counts_unaff_gatk[counts_unaff_gatk[:, 1] == my_max]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 115,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "head: cannot open `affected_snvs_.txt' for reading: No such file or directory\n"
     ]
    }
   ],
   "source": [
    "%%bash\n",
    "cd /data/NCR_SBRB/simplex/gatk_refine\n",
    "head affected_snvs.txt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 56,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "640 /data/NCR_SBRB/simplex/snv_arm/interesting_snvs.txt\n"
     ]
    }
   ],
   "source": [
    "%%bash\n",
    "wc -l /data/NCR_SBRB/simplex/snv_arm/interesting_snvs.txt"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python [conda env:py2.7.10]",
   "language": "python",
   "name": "conda-env-py2.7.10-py"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.10"
  },
  "nav_menu": {},
  "toc": {
   "navigate_menu": true,
   "number_sections": true,
   "sideBar": true,
   "threshold": 6,
   "toc_cell": false,
   "toc_section_display": "block",
   "toc_window_display": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
